library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
fruitfly <- read.csv('fruitfly.csv')
plot(fruitfly$sleep, fruitfly$lifespan)
ggplot() allows us to build up a plot layer by layer.
Put three important features together to draw a graph:
You begin a plot with the function ggplot(), it creates
a coordinate system that we can add layers to, the first argument of
ggplot() is the data to use in the graph, then complete the graph by
adding one or more layers to ggplot(data)
geom_XXX() adds a layer of geometric objects to your
plot, for example geom_point() creates a scatterplot (many
different geom functions for different types of graphs),
each geom_XXX() takes a mapping argument, which is
always paired with aes(), mapping variables to visual
properties. first, mapping variables to coordinate system
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
First, we want to make a random sample from the diamonds data set:
set.seed(922)
diamonds1 <- diamonds[sample(1:53940, 1000, replace = FALSE), ]
glimpse(diamonds1)
## Rows: 1,000
## Columns: 10
## $ carat <dbl> 2.01, 0.77, 0.31, 1.31, 1.51, 0.75, 0.56, 0.30, 1.01, 0.31, 0.…
## $ cut <ord> Very Good, Ideal, Premium, Premium, Ideal, Very Good, Ideal, I…
## $ color <ord> I, D, H, I, J, H, J, I, J, F, D, G, I, E, I, F, H, F, J, D, E,…
## $ clarity <ord> SI1, SI1, IF, VS2, VS1, SI2, VS2, VVS2, VS2, VS2, SI2, VS1, SI…
## $ depth <dbl> 61.8, 60.8, 60.8, 61.9, 61.9, 63.0, 62.0, 62.0, 63.3, 61.3, 63…
## $ table <dbl> 62, 57, 59, 59, 58, 58, 56, 56, 54, 55, 55, 59, 57, 57, 56, 63…
## $ price <int> 13691, 3251, 739, 6323, 8170, 2180, 1224, 515, 3620, 591, 574,…
## $ x <dbl> 7.98, 5.94, 4.36, 7.02, 7.28, 5.76, 5.27, 4.29, 6.42, 4.35, 4.…
## $ y <dbl> 8.07, 5.90, 4.39, 6.98, 7.33, 5.79, 5.31, 4.32, 6.34, 4.39, 4.…
## $ z <dbl> 4.96, 3.60, 2.66, 4.33, 4.52, 3.64, 3.28, 2.67, 4.04, 2.68, 2.…
ggplot(diamonds1) +
geom_point(aes(x = carat, y = price))
ggplot(diamonds1) +
geom_point(aes(x = carat, y = price)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
If we put variables in the first line, they are called the global
variables. Those variables will always affect the entire plot. In
constrast, local variables are those variables we put into the
geom_XXX() function, and they will not affect the other
part of the plot (they will only affect the geom_XXX()
layer. )
In our previous example, there is no different between making
aes(x = carat, y = price) global variables and making them
local variables.
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point() +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
We can save a plot to an object.
p1 <- ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point() +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
p1
sizeggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(size = cut)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
shapeggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(shape = cut)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
## Warning: Using shapes for an ordinal variable is not advised
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = cut)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
We can use multiple mappings:
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(shape = cut, color = cut)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
## Warning: Using shapes for an ordinal variable is not advised
Note: mapping size for numeric variables,and shape for categorical variables. - Mapping to alpha: transparency
ggplot(diamonds1)+
geom_point(aes(x=carat, y= price), alpha=0.1)
ggplot(diamonds1)+
geom_point(aes(x=carat, y= price, alpha=cut))
Note: color and alpha aesthetics can be mapped to either categorical variables or numeric variables.
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, formula = "y~x")
In the example above, aes(x = carat, y = price) is
considered to be our global variables, so we do not need to duplicate
this code over and over again very time we call those variables.
However, if we change it into local variables, we need to duplicate this
code many times:
ggplot(diamonds1) +
geom_point(aes(x = carat, y = price)) +
geom_smooth(aes(x = carat, y = price), method = "lm", se = FALSE, formula = "y~x")
Example We have a pre-set data set called
mtcars, and here’s a preview of it:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Make a scatter plot between wt and mpg, set
the color to cyl and shape to am.
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl), shape = factor(am))) +
geom_point()
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(alpha = 0.2, size = 3, shape = 6, color = "red") + # or use number
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(alpha = 0.2, size = 3, shape = 6, color = rgb(0.5, 0.7, 0.2)) +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_bw()
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(alpha = 0.2, size = 3, shape = 6, color = "#012169") +
labs(x = "Carat", y = "Price", title = "Figure 1:diamonds price by carat") +
theme_minimal()
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = carat))
If we want to change the color scale, we use the
scale_color_gradient() function.
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = carat)) +
scale_color_gradient(name = "Price", low = "darkblue", high = "orange")
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = clarity)) +
scale_color_discrete(name = "Clarity")
If we want to change the color manually, we first need to know how many levels the variable have.
levels(diamonds$clarity)
## [1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
Now, we use scale_colr_manual() to change the color
scale.
ggplot(diamonds1, aes(x=carat, y=price))+
geom_point(aes(color=clarity))+
scale_color_manual(name="Clarity Title", values=c("red", "darkblue","darkgreen", "grey", "grey3", "black","darkred","darkorange"))
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(size = carat)) +
scale_size(name = "Carat Size", range = c(3, 8)) + ## edit legend about size
labs(x = "carat", y = "price", title = "diamonds price by carat")
size is for quantitative variables (numeric vector)
is.factor(diamonds1$cut)
## [1] TRUE
Since cut is a discrete variable, we can use it for
shape.
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(shape = cut),color = "blue") +
scale_shape(name = "Cut Types") + # edit legend about shape
labs(x= "carat", y = "price", title = "diamonds price by carat")
shape is for categorical variables (factor)
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(alpha = price), color = "blue", position = 'jitter') +
scale_alpha(name = "Price") #edit legend of opacity
alpha is for continuous variable (numeric vector).
Note: jitter will randomly
change the position of the data points slightly to show points that are
overlapped.
ggplot(mtcars, aes(x = wt, y = am)) +
geom_point(alpha = 0.1)
Some points are darker tha other points, meaning there is a
overlapping among data points. Try to use jitter () to
separate those overlapping.
ggplot(mtcars, aes(x = wt, y = am)) +
geom_point(position = "jitter")
Note: The final plot created is misleading.
Or we can directly use the geom_jitter() function. In
the geom_jitter() function, we can specify the settings of
jitter.
ggplot(mtcars, aes(x = wt, y = am)) +
geom_jitter(width = 0.5, height = 0.01)
Another way is to use position = position_jitter().
Using this argument in the geom_point() function will
create the same plot as above because we can also specify settings of
jitter inside position_jitter().
ggplot(mtcars, aes(x = wt, y = am)) +
geom_point(position = position_jitter(width = 0.5, height = 0.01))
facet function will put multiple subplots in one
plot.
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = clarity)) +
facet_grid(. ~ cut)
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = clarity)) +
facet_grid(clarity ~ cut)
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = clarity)) +
facet_wrap(. ~ cut)
ggplot(diamonds1, aes(x = carat, y = price)) +
geom_point(aes(color = clarity)) +
facet_wrap(clarity ~ cut)
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 0.7, size = 7, position = "jitter", aes(color = cty)) +
geom_text(aes(label = factor(cyl)),
color = "white", size = 3,
vjust = 1.5, position = position_dodge(0.3),
check_overlap = TRUE)
ggplot(diamonds1, aes(x = cut)) +
geom_bar()
ggplot(diamonds1, aes(x = cut)) +
geom_bar(aes(fill = clarity))
ggplot(diamonds1, aes(x = cut)) +
geom_bar(aes(fill = clarity), position = "dodge")
Use the stat_count() geom function, we will get exactly
the same plot. In ggplot2, each specific geometric symbol
has an unique specific statistical transformation.
ggplot(diamonds1, aes(x = cut)) +
stat_count(aes(fill = clarity), position = "dodge")
ggplot(diamonds1, aes(x = cut, y = price)) +
geom_boxplot(aes(col = clarity))
ggplot(diamonds1, aes(price)) +
geom_histogram(bins = 10)
ggplot(diamonds1, aes(price)) +
geom_density(aes(color = clarity))
?economics
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line()
ggplot(diamonds1, aes(x=cut))+
geom_bar()
Each unique geometric object has an unique corresponding statistical transformation
The previous code is the same as the following:
ggplot(diamonds1, aes(x=cut))+
stat_count()
prop.table(table(diamonds$cut))
##
## Fair Good Very Good Premium Ideal
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
demo <- tribble(
~cut, ~prop,
"Fair", 0.0298,
"Good", 0.0909,
"Very Good", 0.2239,
"Premium", 0.2555,
"Ideal", 0.3995
)
If we run
ggplot(demo, aes(x = cut, y = prop)) +
geom_bar()
in R, we will get an error because the geom function
geom_bar() only accept one variable input. Here, we need to
add a stat="identity" argument in
geom_bar()
ggplot(demo, aes(x = cut, y = prop)) +
geom_bar(stat = "identity")
The previous code is equivalent to the following:
ggplot(demo, aes(x = cut, y = prop)) +
geom_col()
geom_col() required two variable input, so the previous
code can run without any error and any additional argument input.
We can sue coord_flip() to flip the coordinate.
ggplot(demo, aes(x = cut, y = prop)) +
geom_col() +
coord_flip()